10 research outputs found

    Students taught by multimodal teachers are superior action recognizers

    Full text link
    The focal point of egocentric video understanding is modelling hand-object interactions. Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc. as input. The added complexity of the required modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time. Our approach is based on multimodal knowledge distillation, featuring a multimodal teacher (in the current experiments trained only using object detections, optical flow and RGB frames) and a unimodal student (using only RGB frames as input). We present preliminary results which demonstrate that the resulting model -- distilled from a multimodal teacher -- significantly outperforms the baseline RGB model (trained without knowledge distillation), as well as an omnivorous version of itself (trained on all modalities jointly), in both standard and compositional action recognition.Comment: Extended abstract accepted at the 2nd Ego4D Workshop @ ECCV 202

    Linking Surface Facts to Large-Scale Knowledge Graphs

    Full text link
    Open Information Extraction (OIE) methods extract facts from natural language text in the form of ("subject"; "relation"; "object") triples. These facts are, however, merely surface forms, the ambiguity of which impedes their downstream usage; e.g., the surface phrase "Michael Jordan" may refer to either the former basketball player or the university professor. Knowledge Graphs (KGs), on the other hand, contain facts in a canonical (i.e., unambiguous) form, but their coverage is limited by a static schema (i.e., a fixed set of entities and predicates). To bridge this gap, we need the best of both worlds: (i) high coverage of free-text OIEs, and (ii) semantic precision (i.e., monosemy) of KGs. In order to achieve this goal, we propose a new benchmark with novel evaluation protocols that can, for example, measure fact linking performance on a granular triple slot level, while also measuring if a system has the ability to recognize that a surface form has no match in the existing KG. Our extensive evaluation of several baselines show that detection of out-of-KG entities and predicates is more difficult than accurate linking to existing ones, thus calling for more research efforts on this difficult task. We publicly release all resources (data, benchmark and code) on https://github.com/nec-research/fact-linking

    Decoding language spatial relations to 2D spatial arrangements

    No full text
    status: publishe

    Learning to ground medical text in a 3D human atlas

    No full text
    status: accepte

    Multimodal Distillation for Egocentric Action Recognition

    Full text link
    The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well. However, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further adopt a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views. We release our code at https://github.com/gorjanradevski/multimodal-distillation.Comment: Accepted at ICCV 2023; Codebase released at https://github.com/gorjanradevski/multimodal-distillatio

    Cohort-derived machine learning models for individual prediction of chronic kidney disease in people living with HIV: a prospective multicentre cohort study.

    Get PDF
    BACKGROUND It is unclear whether data-driven machine learning models, which are trained on large epidemiological cohorts, may improve prediction of co-morbidities in people living with HIV. METHODS In this proof-of-concept study, we included people living with HIV of the prospective Swiss HIV Cohort Study with a first estimated glomerular filtration rate (eGFR) >60 ml/min/1.73 m2 after January 1, 2002. Our primary outcome was chronic kidney disease (CKD) ─ defined as confirmed decrease in eGFR ≤60 ml/min/1.73 m2 over three months apart. We split the cohort data into a training set (80%), validation set (10%), and test set (10%) ─ stratified for CKD status and follow-up length. RESULTS Of 12,761 eligible individuals (median baseline eGFR, 103 ml/min/1.73 m2), 1,192 (9%) developed a CKD after a median of eight years. We used 64 static and 502 time-changing variables: Across prediction horizons and algorithms and in contrast to expert-based standard models, most machine learning models achieved state-of-the-art predictive performances with areas under the receiver operating characteristic curve and precision recall curve ranging from 0.926 to 0.996 and from 0.631 to 0.956, respectively. CONCLUSIONS In people living with HIV, we observed state-of-the-art performances in forecasting individual CKD onsets with different machine learning algorithms

    A multi-modal AI approach for intuitively instructable autonomous systems

    No full text
    Abstract: We present a multi-modal AI framework to intuitively instruct and control Automated Guided Vehicles. We define a general multi-modal AI architecture, which has a loose coupling between three different AI modules, including spoken language understanding, visual perception and Reinforcement Learning navigation. We use the same multi-modal architecture for two different use cases implemented in two different platforms: an off-road vehicle, which can pick objects, and an indoor forklift that performs automated warehouse inventory. We show how the proposed architecture can be used for a wide range of tasks and can be implemented in different hardware, demonstrating a high degree of modularity
    corecore